Chapter 8 - How to deal with timestamps

It's not obvious how to deal with Unix timestamps in pandas -- it took me quite a while to figure this out. The file we're using here is a popularity-contest file I found on my system at /var/log/popularity-contest.

Here's an explanation of how this file works.

I'm going to hope that nothing in it is sensitive :)

In [13]:

# Read it, and remove the last row
popcon = pd.read_csv('../data/popularity-contest', sep=' ', )[:-1]
popcon.columns = ['atime', 'ctime', 'package-name', 'mru-program', 'tag']

The colums are the access time, created time, package name, recently used program, and a tag

In [14]:

popcon[:5]

Out[14]:

	atime	ctime	package-name	mru-program	tag
0	1387295797	1367633260	perl-base	/usr/bin/perl	NaN
1	1387295796	1354370480	login	/bin/su	NaN
2	1387295743	1354341275	libtalloc2	/usr/lib/x86_64-linux-gnu/libtalloc.so.2.0.7	NaN
3	1387295743	1387224204	libwbclient0	/usr/lib/x86_64-linux-gnu/libwbclient.so.0	<RECENT-CTIME>
4	1387295742	1354341253	libselinux1	/lib/x86_64-linux-gnu/libselinux.so.1	NaN

The magical part about parsing timestamps in pandas is that numpy datetimes are already stored as Unix timestamps. So all we need to do is tell pandas that these integers are actually datetimes -- it doesn't need to do any conversion at all.

We need to convert these to ints to start:

Every numpy array and pandas series has a dtype -- this is usually int64, float64, or object. Some of the time types available are datetime64[s], datetime64[ms], and datetime64[us]. There are also timedelta types, similarly.

We can use the pd.to_datetime function to convert our integer timestamps into datetimes. This is a constant-time operation -- we're not actually changing any of the data, just how pandas thinks about it.

If we look at the dtype now, it's <M8[ns]. As far as I can tell M8 is secret code for datetime64.

So now we can look at our atime and ctime as dates!

In [18]:

popcon[:5]

Out[18]:

	atime	ctime	package-name	mru-program	tag
0	2013-12-17 15:56:37	2013-05-04 02:07:40	perl-base	/usr/bin/perl	NaN
1	2013-12-17 15:56:36	2012-12-01 14:01:20	login	/bin/su	NaN
2	2013-12-17 15:55:43	2012-12-01 05:54:35	libtalloc2	/usr/lib/x86_64-linux-gnu/libtalloc.so.2.0.7	NaN
3	2013-12-17 15:55:43	2013-12-16 20:03:24	libwbclient0	/usr/lib/x86_64-linux-gnu/libwbclient.so.0	<RECENT-CTIME>
4	2013-12-17 15:55:42	2012-12-01 05:54:13	libselinux1	/lib/x86_64-linux-gnu/libselinux.so.1	NaN

Now suppose we want to look at all packages that aren't libraries.

First, I want to get rid of everything with timestamp 0. Notice how we can just use a string in this comparison, even though it's actually a timestamp on the inside? That is because pandas is amazing.

Now we can use pandas' magical string abilities to just look at rows where the package name doesn't contain 'lib'.

In [21]:

nonlibraries.sort_values('ctime', ascending=False)[:10]

Out[21]:

	atime	ctime	package-name	mru-program	tag
57	2013-12-17 04:55:39	2013-12-17 04:55:42	ddd	/usr/bin/ddd	<RECENT-CTIME>
450	2013-12-16 20:03:20	2013-12-16 20:05:13	nodejs	/usr/bin/npm	<RECENT-CTIME>
454	2013-12-16 20:03:20	2013-12-16 20:05:04	switchboard-plug-keyboard	/usr/lib/plugs/pantheon/keyboard/options.txt	<RECENT-CTIME>
445	2013-12-16 20:03:20	2013-12-16 20:05:04	thunderbird-locale-en	/usr/lib/thunderbird-addons/extensions/langpac...	<RECENT-CTIME>
396	2013-12-16 20:08:27	2013-12-16 20:05:03	software-center	/usr/sbin/update-software-center	<RECENT-CTIME>
449	2013-12-16 20:03:20	2013-12-16 20:05:00	samba-common-bin	/usr/bin/net.samba3	<RECENT-CTIME>
397	2013-12-16 20:08:25	2013-12-16 20:04:59	postgresql-client-9.1	/usr/lib/postgresql/9.1/bin/psql	<RECENT-CTIME>
398	2013-12-16 20:08:23	2013-12-16 20:04:58	postgresql-9.1	/usr/lib/postgresql/9.1/bin/postmaster	<RECENT-CTIME>
452	2013-12-16 20:03:20	2013-12-16 20:04:55	php5-dev	/usr/include/php5/main/snprintf.h	<RECENT-CTIME>
440	2013-12-16 20:03:20	2013-12-16 20:04:54	php-pear	/usr/share/php/XML/Util.php	<RECENT-CTIME>

Okay, cool, it says that I I installed ddd recently. And postgresql! I remember installing those things. Neat.

The whole message here is that if you have a timestamp in seconds or milliseconds or nanoseconds, then you can just "cast" it to a 'datetime64[the-right-thing]' and pandas/numpy will take care of the rest.

8.1 Parsing Unix timestamps¶